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The Random Walk Metropolis: Linking 
Theory and Practice Through a Case 
Study 

Chris Sherlock, Paul Fearnhead and Gareth O. Roberts 



Abstract. The random walk Metropolis (RWM) is one of the most 
common Markov chain Monte Carlo algorithms in practical use today. 
Its theoretical properties have been extensively explored for certain 
classes of target, and a number of results with important practical im- 
plications have been derived. This article draws together a selection of 
new and existing key results and concepts and describes their implica- 
tions. The impact of each new idea on algorithm efficiency is demon- 
strated for the practical example of the Markov modulated Poisson 
process (MMPP). A reparameterization of the MMPP which leads to a 
highly efficient RWM-within-Gibbs algorithm in certain circumstances 
is also presented. 

Key words and phrases: Random walk Metropolis, Metropolis-Hastings, 
MCMC, adaptive MCMC, MMPP. 



1. INTRODUCTION 

Markov chain Monte Carlo (MCMC) algorithms 
provide a framework for sampling from a target ran- 
dom variable with a potentially complicated proba- 
bility distribution tt(-) by generating a Markov chain 
xWjX*- 2 ), . . . with stationary distribution 7r(-). The 
single most widely used subclass of MCMC algo- 
rithms is based around the random walk Metropolis 
(RWM). 

Theoretical properties of RWM algorithms for cer- 
tain special classes of target have been investigated 
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extensively. Reviews of RWM theory have, for exam- 
ple, dealt with optimal scaling and posterior shape 
(Roberts and Rosenthal, 2001), and convergence 
(Roberts, 2003). This article does not set out to 
be a comprehensive review of all theoretical results 
pertinent to the RWM. Instead the article reviews 
and develops specific aspects of the theory of RWM 
efficiency in order to tackle an important and diffi- 
cult problem: inference for the Markov modulated 
Poisson process (MMPP). It includes sections on 
RWM within Gibbs, hybrid algorithms, and adap- 
tive MCMC, as well as optimal scaling, optimal shap- 
ing, and convergence. A strong emphasis is placed 
on developing an intuitive understanding of the pro- 
cesses behind the theoretical results, and then on us- 
ing these ideas to improve the implementation. All 
of the RWM algorithms described in this article are 
tested against datasets arising from MMPPs. Real- 
ized changes in efficiency are then compared with 
theoretical predictions. 

Observed event times of an MMPP arise from a 
Poisson process whose intensity varies with the state 
of an unobserved continuous-time Markov chain. The 
MMPP has been used to model a wide variety of 
clustered point processes, for example, requests for 
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web pages from users of the World Wide Web (Scott 
and Smyth, 2003), arrivals of photons from single- 
molecule fluorescence experiments (Burzykowski, 
Szubiakowski and Ryden, 2003; Kou, Xie and Liu, 
2005), and occurrences of a rare DNA motif along a 
genome (Fearnhead and Sherlock, 2006). 

In common with mixture models and other hidden 
Markov models, inference for the MMPP is greatly 
complicated by a lack of knowledge of the hidden 
data. The likelihood function often possesses many 
minor modes since the data might be approximately 
described by a hidden process with fewer states. For 
this same reason the likelihood often does not ap- 
proach zero as certain combinations of parameters 
approach zero and /or infinity and so improper priors 
lead to improper posteriors (e.g., Sherlock, 2005). 
Further, as with many hidden data models the like- 
lihood is invariant under permutation of the states, 
and this "labeling" problem leads to posteriors with 
several equal modes. 

This article focuses on generic concepts and tech- 
niques for improving the efficiency of RWM algo- 
rithms whatever the statistical model. The MMPP 
provides a nontrivial testing ground for them. All 
of the RWM algorithms described in this article are 
tested against two simulated MMPP datasets with 
very different characteristics. This allows us to demon- 
strate the influence on performance of posterior at- 
tributes such as shape and orientation near the mode 
and lightness or heaviness of tails. 

Section 2 introduces RWM algorithms and then 
describes theoretical and practical measures of al- 
gorithm efficiency. Next the two main theoretical 
approaches to determining efficiency are described, 
and the section ends with a brief overview of the 
MMPP and a description of the data analyzed in 
this article. Section 3 introduces a series of con- 
cepts which allow potential improvements in the ef- 
ficiency of a RWM algorithm. The intuition behind 
each concept is described, followed by theoretical 
justification and then details of one or more RWM 
algorithms motivated by the theory. Actual results 
are described and compared with theoretical predic- 
tions in Section 4, and the article is summarized in 
Section 5. 

2. BACKGROUND 

In this section we introduce the background ma- 
terial on which the remainder of this article draws. 
We describe the random walk Metropolis algorithm 



and a variation, the random walk Metropolis-within- 
Gibbs. Both practical issues and theoretical 
approaches to algorithm efficiency are then discussed. 
We conclude with an introduction to the Markov 
modulated Poisson process and to the datasets used 
later in the article. 

2.1 Random Walk Metropolis Algorithms 

The random walk Metropolis (RWM) updating 
scheme was first applied by Metropolis et al. (1953) 
and proceeds as follows. Given a current value of the 
ci-dimensional Markov chain, X, a new value X* is 
obtained by proposing a jump Y* := X* — X from 
the prespecified Lebesgue density 

with r(y) = r(— y) for all y. Here A > governs the 
overall size of the proposed jump and (see Section 
3.1) plays a crucial role in determining the efficiency 
of any algorithm. The proposal is then accepted or 
rejected according to acceptance probability 

(2) a ( x , y *) = min(l,^±p). 

If the proposed value is accepted it becomes the next 
current value (X' <— X + Y*); otherwise the current 
value is left unchanged (X' <— X). 

An intuitive interpretation of the above formula 
is that "uphill" proposals (proposals which take the 
chain closer to a local mode) are always accepted, 
whereas "downhill" proposals are accepted with prob- 
ability exactly equal to the relative "heights" of the 
posterior at the proposed and current values. It is 
precisely this rejection of some "downhill" propos- 
als which acts to keep the Markov chain in the main 
posterior mass most of the time. 

More formally, denote by P(x, •) the transition 
kernel of the chain, which represents the combined 
process of proposal and acceptance/rejection lead- 
ing from one element of the chain (x) to the next. 
The acceptance probability (2) is chosen so that the 
chain is reversible at equilibrium with stationary 
distribution vr(-). Reversibility [that 7r(x)P(x,x') = 
tt(x')P(x' , x)] is an important property precisely be- 
cause it is so easy to construct reversible chains 
which have a prespecified stationary distribution. It 
is also possible to prove a slightly stronger central 
limit theorem for reversible (as opposed to nonre- 
versible) geometrically ergodic chains (e.g., Section 
2.2.1). 
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We now describe a generalization of the RWM 
which acts on a target whose components have been 
split into k sub-blocks. In general we write X = 
(Xi, . . . , Xfc), where Xj is the ith sub-block of com- 
ponents of the current element of the chain. Starting 
from value X, a single iteration of this algorithm cy- 
cles through all of the sub-blocks updating each in 
turn. It will therefore be convenient to define the 
shorthand 

(B) ._ / / 
Xj ■ — x 1j ■ • ■ j x j_ 1) x j, x i+l 5 ■ • • ; x fc) 

(B)* / / _i_ * 

x j x i, ■ • ■ , x j_i, x i T Vj , Xj_|_i , . . . ,Xfc, 

where x^- is the updated value of the jth sub-block. 
For the ith sub-block a jump Y* is proposed from 
symmetric density fj(y; A,) and accepted or rejected 
according to acceptance probability 7r(x^*)/ 

7r(x^ ). Since this algorithm is in fact a generaliza- 
tion of both the RWM and the Gibbs sampler (for 
a description of the Gibbs sampler see, e.g., Gamer- 
man and Lopes, 2006) we follow, for example, Neal 
and Roberts (2006) and call this the random walk 
Metropolis- within- Gibbs or RWM-within-Gibbs. 
The most commonly used random walk 
Metropolis- within-Gibbs algorithm, and also the sim- 
plest, is that employed in this article: here all blocks 
have dimension 1 so that each component of the pa- 
rameter vector is updated in turn. 

As mentioned earlier in this section, the RWM is 
reversible; but even though each stage of the RWM- 
within-Gibbs is reversible, the algorithm as a whole 
is not. Reversible variations include the random scan 
RWM-within-Gibbs, wherein at each iteration a sin- 
gle component is chosen at random and updated 
conditional on all the other components. 

Convergence of the Markov chain to its stationary 
distribution can be guaranteed for all of the above 
algorithms under quite general circumstances (e.g., 
Gilks, Richardson and Spiegelhalter, 1996). 

2.2 Algorithm Efficiency 

Consecutive draws of an MCMC Markov chain 
are correlated and the sequence of marginal distri- 
butions converges to vr(-). Two main (and related) 
issues arise with regard to the efficiency of MCMC 
algorithms: convergence and mixing. 

2.2.1 Convergence In this article we will be con- 
cerned with practical determination of a point at 
which a chain has converged. The method we em- 
ploy is simple heuristic examination of the trace 



plots for the different components of the chain. Note 
that since the state space is multidimensional it is 
not sufficient to simply examine a single component. 
Alternative techniques are discussed in Chapter 7 
of the book by Gilks, Richardson and Spiegelhalter 
(1996). 

Theoretical criteria for ensuring convergence (er- 
godicity) of MCMC Markov chains are examined in 
detail in Chapters 3 and 4 of the book by Gilks, 
Richardson and Spiegelhalter (1996) and references 
therein, and will not be discussed here. We do, how- 
ever, wish to highlight the concepts of geometric and 
polynomial ergodicity. A Markov chain with transi- 
tion kernel P is geometrically ergodic with station- 
ary distribution n(-) if 

(3) ||P™(x,.)-vr(.)||i<M(xK 

for some positive r < 1 and M(-) > 0; if M(-) is 
bounded above, then the chain is uniformly ergodic. 
Here ||-F(-) — G(-)||i denotes the total variational dis- 
tance between measures F(-) and G(-) (see, e.g., 
Meyn and Tweedie, 1993), and P n is the n-step 
transition kernel. Efficiency of a geometrically er- 
godic algorithm is measured by the geometric rate 
of convergence, r, which over a large number of it- 
erations is well approximated by the second largest 
eigenvalue of the transition kernel [the largest eigen- 
value being 1, and corresponding to the station- 
ary distribution 7r(-)]. Geometric ergodicity is usu- 
ally a purely qualitative property since in general 
the constants M(x) and r are not known. Crucially 
for practical MCMC, however, any geometrically er- 
godic reversible Markov chain satisfies a central limit 
theorem for all functions with finite second moment 
with respect to vr(-). Thus there is a a 2 < oo such 
that 

(4) n^ 2 (f n -E n [f(X)])^N(0,a 2 f ), 

where => denotes convergence in distribution. The 
central limit theorem (4) not only guarantees con- 
vergence of the Monte Carlo estimate (5) but also 
supplies its standard error, which decreases as n -1 / 2 . 

When the second largest eigenvalue is also 1, a 
Markov chain is termed polynomially ergodic if 

||P"(x,.)-^(-)lli<M( x )™- r . 

Clearly polynomial ergodicity is a weaker condition 
than geometric ergodicity. Central limit theorems 
for polynomially ergodic MCMC are much more del- 
icate; see the article by Jarner and Roberts (2002) 
for details. 
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In this article a chain is referred to as having 
"reached stationarity" or "converged" when the dis- 
tribution from which an element is sampled is as 
close to the stationary distribution as to make no 
practical difference to any Monte Carlo estimates. 

An estimate of the expectation of a given function 
f{X), which is more accurate than a naive Monte 
Carlo average over all the elements of the chain, is 
likely to be obtained by discarding the portion of 
the chain Xo, . • ■ , X m up until the point at which it 
was deemed to have reached stationarity; iterations 
1, . . . , m are commonly termed "burn in." Using only 
the remaining elements X m+ i, . . . , X m+n (with m + 
n = N) our Monte Carlo estimator becomes 

1 m+n 

(5) /„ := ± £ /(Xi). 

m+1 

Convergence and burn in are not discussed any fur- 
ther here, and for the rest of this section the chain 
is assumed to have started at stationarity and con- 
tinued for n further iterations. 

2.2.2 Practical measures of mixing efficiency For 
a stationary chain, Xo is sampled from vr(-), and so 
for all k > and i > 

Cov[/(X fc ),/(X fc+i )] = Cov[/(X ),/(X i )]. 

This is the autocorrelation at lag i. Therefore at 
stationarity, from the definition in (4), 

a 2 * := lim nVar[/ n ] 

oo 

= Var[/(X )] + 2 £ Cov[/(X ), /(X<)] 

i=X 

provided the sum exists (e.g., Geyer, 1992). If el- 
ements of the stationary chain were independent, 
then a 2 r would simply be Var[/(Xo)] and so a mea- 
sure of the inefficiency of the Monte Carlo estimate 
f n relative to the perfect i.i.d. sample is 

a 2 00 

(6) ^i7fe)I = 1+2 g OOTr l« x <>^( x <>l- 

This is the integrated autocorrelation time (ACT) 
and represents the effective number of dependent 
samples that is equivalent to a single independent 
sample. Alternatively n* = n/ACT may be regarded 
as the effective equivalent sample size if the elements 
of the chain had been independent. 

To estimate the ACT in practice one might exam- 
ine the chain from the point at which it is deemed 



to have converged and estimate the lag-i autocorre- 
lation Corr[/(X ),/(Xi)] by 

^ n—i 

(7) 7, = ^(/(X,-) " fn)(f(*^) ~ fn)- 

Naively, substituting these into (6) gives an estimate 
of the ACT. However, contributions from all terms 
with very low theoretical autocorrelation in a real 
run are effectively random noise, and the sum of 
such terms can dominate the deterministic effect in 
which we are interested (e.g., Geyer, 1992). For this 
article we employ the simple solution suggested by 
Carlin and Louis (2009): the sum (6) is truncated 
from the first lag, I, for which the estimated auto- 
correlation drops below 0.05. This gives the (slightly 
biased) estimator 

2-1 

(8) ACT cst :=l + 2^ 7i . 

i=l 

Given the potential for relatively large variance in 
estimates of integrated ACT howsoever they might 
be obtained (e.g., Sokal, 1997), this simple estima- 
tor should be adequate for comparing the relative 
efficiencies of the different algorithms in this article. 
Geyer (1992) provided a number of more complex 
window estimators and provided references for reg- 
ularity conditions under which they are consistent. 

A given run will have a different ACT associated 
with each parameter. An alternative efficiency mea- 
sure, which is aggregated over all parameters, is pro- 
vided by the Mean Squared Euclidean Jump Dis- 
tance (MSEJD) 

1 n— 1 

J Euc •- 1 / Jl x x Il2- 

i=\ 

The expectation of this quantity at stationarity is re- 
ferred to as the Expected Squared Euclidean Jump 
Distance (ESEJD). Consider a single component of 
the target with variance af := VarLYj] = Var[X-], 
and note that EfA^' — Xj\ = 0, so 

E[(Xl - Xi) 2 } = VarK - X { ] 

= 2<j 2 (l-Corr[Xi,Xl]). 

Thus when the chain is stationary and the posterior 
variance is finite, maximizing the ESEJD is equiva- 
lent to minimizing a weighted sum of the lag-1 au- 
tocorrelations. 

If the target has finite second moments and is 
roughly elliptical in shape with (known) covariance 
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matrix S, then an alternative measure of efficiency 
is the Mean Squared Jump Distance (MSJD) 



1 



n 



n-l 



(x (m)_ x » ) * 5] -i (x ( J+ i)_ x 



which is proportional to the unweighted sum of the 
lag-1 autocorrelations over the principal components 
of the ellipse. The theoretical expectation of the 
MSJD at stationarity is known as the expected 
squared jump distance (ESJD). 

Figure 1 shows traceplots for three different 
Markov chains. Estimates of the autocorrelation from 
lag-0 to lag-40 for each Markov chain appear along- 
side the corresponding traceplot. The simple win- 
dow estimator for integrated ACT provides estimates 
of, respectively, 39.7, 5.5, and 35.3. The MSEJDs 
are, respectively, 0.027, 0.349, and 0.063, and are 
equal to the MSJDs since the stationary distribu- 
tion has a variance of 1. 

2.2.3 Assessing accuracy An MCMC algorithm 
might efficiently explore an unimportant part of the 
parameter space and never find the main posterior 
mass. ACT's will be low, therefore, but the resulting 
posterior estimate will be wildly inaccurate. In most 
practical examples it is not possible to determine the 
accuracy of the posterior estimate, though consis- 
tency between several independent runs or between 
different portions of the same run can be tested. 

For the purposes of this article it was important to 
have a relatively accurate estimate of the posterior, 
not determined by a RWM algorithm. Fearnhead 
and Sherlock (2006) detailed a Gibbs sampler for 
the MMPP; this Gibbs sampler was run for 100,000 
iterations on each of the datasets analyzed in this 
article. A "burn in" of 1000 iterations was allowed 
for, and a posterior estimate from the last 99,000 
iterations was used as a reference for comparison 
with posterior estimates from RWM runs of 10,000 
iterations (after burn in). 

2.2.4 Theoretical approaches for algorithm 
efficiency To date, theoretical results on the effi- 
ciency of RWM algorithms have been obtained 
through two very different approaches. We wish to 
quote, explain, and apply theory from both and so 
we give a heuristic description of each and define as- 
sociated notation. Both approaches link some mea- 
sure of efficiency to the expected acceptance rate — 
the expected proportion of proposals accepted at 
stationarity. 



The first approach was pioneered by Roberts, Gel- 
man and Gilks (1997) for targets with independent 
identically distributed components and then gener- 
alized by Roberts and Rosenthal (2001) to targets 
of the form 



n(x) = f[C i f(C i x i ). 



The inverse scale parameters, Cj, are assumed to 
be drawn from some distribution with a given (fi- 
nite) mean and variance. A single component of the 
ci-dimensional chain (without loss of generality the 
first) is then examined; at iteration i of the algo- 
rithm it is denoted x[*v . A scaleless, speeded up, 
continuous-time process which mimics the first com- 
ponent of the chain is defined as 



(d) .. 



(d) 



[td}> 



where [u] denotes the nearest integer less than or 
equal to u. Finally, proposed jumps are assumed to 
be Gaussian 

Y (d) ~N(0,\ 2 d I). 

Subject to conditions on the first two deriviatives of 
/(•), Roberts and Rosenthal (2001) showed that if 
E[Ci) = 1 and E[C 2 } = b, and provided X d = fi/d 1 / 2 
for some fixed fi (the scale parameter but "rescaled" 
according to dimension), then as d — >oo, ap- 
proaches a Langevin diffusion process with speed 



■ad 



(9) 



where a d '■= 25> 



Here <&(x) is the cumulative distribution function of 
a standard Gaussian, J : = E[((log f)') 2 ] is a measure 
of the roughness of the target, and eld corresponds 
to the acceptance rate. 

Bedard (2007) proved a similar result for a tri- 
angular sequence of inverse scale parameters Cj^, 
which are assumed to be known. A necessary and 
sufficient condition equivalent to (11) below is at- 
tached to this result. In effect this requires the scale 
over which the smallest component varies to be "not 
too much smaller" than the scales of the other com- 
ponents. 

The second technique (e.g., Sherlock and Roberts, 
2009) uses expected squared jump distance (ESJD) 
as a measure of efficiency. Exact analytical forms for 
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(a) Chain: lambda=0.24 (b) Chain: lambda=2.4 (c) Chain: lambda=24 




10 20 30 40 10 20 30 40 10 20 30 40 

Lag Lag Lag 

Fig. 1. Traceplots [(&), (b), and (c)J and corresponding autocorrelation plots [(d), (e), and (I)], for exploration of a 
standard Gaussian initialized from x = and using the random walk Metropolis algorithm with Gaussian proposal for 1000 
iterations. Proposal scale parameters for the three scenarios are, respectively, (a.) and (A) 0.24, (b) and (e) 2.4, and (c) and 
(i) 24. 



ESJD (denoted S^) and expected acceptance rate 
are derived for any unimodal elliptically symmetric 
target and any proposal density. Many standard se- 
quences of d-dimensional targets id = 1, 2, . . .), such 
as the Gaussian, satisfy the condition that as d — > 
oo the probability mass becomes concentrated in a 
spherical shell which itself becomes infinitesimally 
thin relative to its radius. Thus the random walk 
on a rescaling of the target is, in the limit, effec- 
tively confined to the surface of this shell. Sherlock 
and Roberts (2009) considered a sequence of tar- 
gets which satisfies such a "shell" condition, and 
a sequence of proposals which satisfies a slightly 
stronger condition. Specifically it is required that 
there exist sequences of positive real numbers, {k x } 



and {ky }, such that 



|XW| 



k 



(d) 



1 and 



||YW|| 

Xdky 



1. 



For such combinations of target and proposal, as 
d — > oo 

d 



(10) 



k 



(d) 



fi 2 a d 



with ad(ft) :- 



2*( 



Here ad is the limiting expected acceptance rate, 
and fi := d 1 / 2 X^ky / kx ■ For target and proposal 
distributions with independent components, such as 
are used in the diffusion results, k^ = ky = d 1 / 2 , 
and hence (consistently) \i = d l l 2 \d- 
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It is also required that the elliptical target not be 
too eccentric. Specifically, for a sequence of target 
densities 7r rf (x) := fd(Y^i=i c i,d x i) ( for some appro- 
priate sequence of functions {fd}) 

maxj c 2 d 

(11) — j '■ > as d — > oo. 

Theoretical results from the two techniques are 
remarkably similar and as will be seen, lead to iden- 
tical strategies for optimizing algorithm efficiency. 
It is worth noting, however, that results from the 
first approach apply only to targets with indepen- 
dent components and results from the second only to 
targets which are unimodal and elliptically symmet- 
ric. That they lead to identical strategies indicates 
a certain potential robustness of these strategies to 
the form of the target. This potential, as we shall 
see, is borne out in practice. 

2.3 The Markov Modulated Poisson Process 

Let Xt be a continuous-time Markov chain on dis- 
crete state space {1, . . . ,d} and let if) := [t/>i , . . . , tpd] 
be a d-dimensional vector of (nonnegative) intensi- 
ties. The linked but stochastically independent Pois- 
son process Yt whose intensity is ipx t is a Markov 
modulated Poisson process — it is a Poisson process 
whose intensity is modulated by a continuous-time 
Markov chain. 

The idea is best illustrated through two exam- 
ples, which also serve to introduce the notation and 
datasets that will be used throughout this article. 
Consider a two-dimensional Markov chain Xt with 
generator Q with q\2 = q%\ = 1. 

Figure 2 shows realizations from two such chains 
over a period of 10 seconds. Now consider a Pois- 
son process Yt which has intensity 10 when Xt is in 
state 1 and intensity 30 when Xt is in state 2. This is 
an MMPP with event intensity vector i/> = [10,30]. 
A realization (obtained via the realization of Xt) is 
shown as a rug plot underneath the chain in the up- 
per graph. The lower graph shows a realization from 
an MMPP with event intensities [10, 17]. 

It can be shown (e.g., Fearnhead and Sherlock, 
2006) that the likelihood for data from an MMPP 
which starts from a distribution v over its states is 

L(Q,*,t) 

(12) 

= j/gCQ-*)*! ^ . . . e (Q-*)t„^ e (Q-*)tn+il. 

Here \l/ := diag(i/'), 1 is a vector of l's, n is the 
number of observed events, t\ is the time from the 
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Fig. 2. Two 2-state continuous-time Markov chains simu- 
lated for 10 seconds from generator Q with gi2 = Q2i = 1; the 
rug plots show events from an MMPP simulated from these 
chains, with intensity vectors tp = [10,30] (upper graph) and 
ip — [10, 17] (lower graph). 

start of the observation window until the first event, 
t n+ i is the time from the last event until the end of 
the observation window, and tj(2 < i < n) is the time 
between the {i — l)th and ith events. In the absence 
of further information, the initial distribution v is 
often taken to be the stationary distribution of the 
underlying Markov chain. 

The likelihood of an MMPP is invariant to a re- 
labeling of the states. Hence if the prior is simi- 
larly invariant, then so too is the posterior: if the 
posterior for a two-dimensional MMPP has a mode 
at (^i, ip2, Q12, 921 )) then it has an identical mode 
at (ip2, ipi, Q21, 512) • In this article our overriding in- 
terest is in the efficiency of the MCMC algorithms 
rather than the exact meaning of the parameters 
and so we choose the simplest solution to this iden- 
tifiability problem: the state with the lower Poisson 
intensity ip is always referred to as state 1. 

2.3.1 MMPP data in this article The two datasets 
of event times used in this article arose from two in- 
dependent MMPP's simulated over an observation 
window of 100 seconds. Both underlying Markov 
chains have q\2 = q2i = 1; dataset Dl has event in- 
tensity vector ijj = [10, 30] whereas dataset D2 has 
ip = [10,17], so that the overall intensity of events 
in D2 is lower than in Dl. As mentioned in Sec- 
tion 2.2.3, a posterior sample from a long run of 
the Gibbs sampler of Fearnhead and Sherlock (2006) 
was used to approximate the true posterior. Figure 
3 shows estimates of the marginal posterior distri- 
bution for (^1,^2) and for (^1,912) for Dl (top) and 
D2 (bottom). 
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Fig. 3. Estimated marginal posteriors for andip2 and for 
i/>i and qi2 from long runs of the Gibbs sampler for datasets 
Dl (top) and D2 (bottom). 

Because the difference in intensity between the 
states is so much larger in Dl than in D2 it is eas- 
ier with Dl than D2 to distinguish the state of the 
underlying Markov chain, and thus the values of 
the Markov and Poisson parameters. Further, in the 
limit of the underlying chain being known precisely, 
for example as ip2 — > oo with ipi finite, and provided 
the priors are independent, the posteriors for the 
Poisson intensity parameters tp\ and ip2 are com- 
pletely independent of each other and of the Markov 
parameters q\2 and q2\- Dependence between the 
Markov parameters is also small, being 0(1/T) (e.g., 
Fearnhead and Sherlock, 2006). 

In Section 4, differences between Dl and D2 will 
be related directly to observed differences in effi- 
ciency of the various RWM algorithms between the 
two datasets. 

3. IMPLEMENTATIONS OF THE RWM: 
THEORY AND PRACTICE 

This section describes several theoretical results 
for the RWM or for MCMC in general. Intuitive 
explanation of the principle behind each result is 
emphasized and the manner in which it informs the 
RWM implementation is made clear. Each algorithm 
was run three times on each of the two datasets. 



3.1 Optimal Scaling of the RWM 

Intuition: Consider the behavior of the RWM as 
a function of the overall scale parameter of the pro- 
posed jump, A, in (1). If most proposed jumps are 
small compared with some measure of the scale of 
variability of the target distribution, then, although 
these jumps will often be accepted, the chain will 
move slowly and exploration of the target distribu- 
tion will be relatively inefficient. If the jumps pro- 
posed are relatively large compared with the tar- 
get distribution's scale, then many will not be ac- 
cepted, the chain will rarely move, and will again ex- 
plore the target distribution inefficiently. This sug- 
gests that given a particular target and form for the 
jump proposal distribution, there may exist a finite 
scale parameter for the proposal with which the al- 
gorithm will explore the target as efficiently as pos- 
sible. These ideas are clearly demonstrated in Fig- 
ure 1 which shows traceplots for a one-dimensional 
Gaussian target explored using a Gaussian proposal 
with scale parameter an order of magnitude smaller 
(a) and larger (c) than is optimal, and (b) with a 
close to optimal scale parameter. 

Theory: Equation (9) gives algorithm efficiency for 
a target with independent and identical (up to a 
scaling) components as a function of the "rescaled" 
scale parameter /i = d 1 / 2 ^ of a Gaussian proposal. 
Equation (10) gives algorithm efficiency for a uni- 
modal elliptically symmetric target explored by a 
spherically symmetric proposal with fi = d 1 ^ 2 Xdk^ / 

k x ■ Efficiencies are therefore optimal at /x~2.38/ 
J 1 / 2 and fi pa 2.38, respectively. These correspond to 
actual scale parameters of respectively 

2.38 _ 2.38fc£ d) 

Arf "Jw and d ~JJ^- 

The equivalence between these two expressions for 
Gaussian data explored with a Gaussian target is 
clear from Section 2.2.4. However, the equations of- 
fer little direct help in choosing a scale parameter 
for a target which is neither elliptical nor possesses 
components which are i.i.d. up to a scale parameter. 
Substitution of each expression into the correspond- 
ing acceptance rate equation, however, leads to the 
same optimal acceptance rate, a pa 0.234. This justi- 
fies the relatively well-known adage that for random 
walk algorithms with a large number of parameters, 
the scale parameter of the proposal should be chosen 
so that the acceptance rate is approximately 0.234. 



LINKING THEORY AND PRACTICE 



9 



On a graph of asymptotic efficiency against accep- 
tance rate (e.g., Roberts and Rosenthal, 2001), the 
curvature near the mode is slight, especially to its 
right, so that an acceptance rate of anywhere be- 
tween 0.2 and 0.3 should lead to an algorithm of 
close to optimal efficiency. 

In practice updates are performed on a finite num- 
ber of parameters; for example, a two-dimensional 
MMPP has four parameters (tp±, ■02, Q12, Q2i)- A block 
update involves all of these, while each update of 
a simple Metropolis-within-Gibbs step involves just 
one parameter. In finite dimensions the optimal ac- 
ceptance rate can in fact take any value between 
and 1. Sherlock and Roberts (2009) provided ana- 
lytical formulas for calculating the ESJD and the 
expected acceptance rate for any proposal and any 
elliptically symmetric unimodal target. In one di- 
mension, for example, the optimal acceptance rate 
for a Gaussian target explored by a Gaussian pro- 
posal is 0.44, while the optimum for a Laplace target 
(tt(x) oc e - ' 21 ') explored with a Laplace proposal is 
exactly a = 1/3. Sherlock (2006) considered several 
simple examples of spherically symmetric proposal 
and target across a range of dimensions and found 
that in all cases curvature at the optimal acceptance 
rate is small, so that a range of acceptance rates is 
nearly optimal. Further, the optimal acceptance rate 
is itself between 0.2 and 0.3 for d > 6 in all the cases 
considered. 

Sherlock and Roberts (2009) also weakened the 
"shell" condition of Section 2.2.4 and considered se- 
quences of spherically symmetric targets for which 
the (rescaled) radius converges to some random vari- 
able R rather than a point mass at 1. It is shown 
that, provided the sequence of proposals still satis- 
fies the shell condition, the limiting optimal accep- 
tance rate is strictly less than 0.234. Acceptance rate 
tuning should thus be seen as only a guide, though a 
guide which has been found to be robust in practice. 

Algorithm 1 (Blk). The first algorithm (Blk) 
used to explore datasets Dl and D2 is a four-dimen- 
sional block updating RWM with proposal Y ~ A^(0, 
A 2 I) and A tuned so that the acceptance rate is ap- 
proximately 0.3. 

3.2 Optimal Scaling of the RWM-Within-Gibbs 

Intuition: Consider first a target either spheri- 
cally symmetric, or with i.i.d. components, and let 
the overall scale of variability of the target be n. 
For full block proposals the optimal scale parameter 



should be 0{r\jdfl 2s ) so that the square of the magni- 
tude of the total proposal is 0(rj 2 ). If a Metropolis- 
within-Gibbs update is to be used with k sub-blocks 
and d* = d/k of the components updated at each 
stage, then the optimal scale parameter should be 

1/2 

larger, 0{n/dj ). However, only one of the k stages 
of the RWM-within-Gibbs algorithm updates any 
given component whereas with k repeats of a block 
RWM that component is updated k times. Consider- 
ing the squared jump distances it is easy to see that, 
given the additivity of squared jump distances, the 
larger size of the RWM-within-Gibbs updates is ex- 
actly canceled by their lower frequency, and so (in 
the limit) there is no difference in efficiency when 
compared with a block update. The same intuition 
applies when comparing a random scan Metropolis- 
within-Gibbs scheme with a single block update. 

Now consider a target for which different compo- 
nents vary on different scales. If sub-blocks are cho- 
sen so as to group together components with sim- 
ilar scales, then a Metropolis-within-Gibbs scheme 
can apply suitable scale paramaters to each block 
whereas a single block update must choose one scale 
parameter that is adequate for all components. In 
this scenario, Metropolis-within-Gibbs updates should 
therefore be more efficient. 

Theory: Neal and Roberts (2006) considered a ran- 
dom scan RWM-within-Gibbs algorithm on a target 
distribution with i.i.d. components and using i.i.d. 
Gaussian proposals all having the same scale param- 
eter Xd = fi/d 1 / 2 . At each iteration a fraction, 7^, of 
the d components are chosen uniformly at random 
and updated block. It is shown [again subject 
to differentiability conditions on /(•)] that the pro- 
cess : = X^ t( q approaches a Langevin diffusion 
with speed 

h 1 {^) = 2 1 ^{-\^ 1 J) 1 / 2 ), 

where 7 := lim^oo 7^. The optimal scaling is there- 
fore larger than for a standard block update (by a 
factor of 7" 1 / 2 ) but the optimal speed and the op- 
timal acceptance rate (0.234) are identical to those 
found by Roberts, Gelman and Gilks (1997). 

Sherlock (2006) considered sequential Metropolis- 
within-Gibbs updates on a unimodal elliptically sym- 
metric target, using spherical proposal distributions 
but allowing different scale parameters for the pro- 
posals in each sub-block. The k sub-blocks are as- 
sumed to correspond to disjoint subsets of the prin- 
cipal axes of the ellipse and updates for each are 
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assumed to be optimally tuned. Efficiency is consid- 
ered in terms of ESEJD and is again found to be 
optimal (as d — > oo) when the acceptance rate for 
each sub-block is 0.234. For equal sized sub-blocks, 
the relative efficiency of the Metr op olis- within- Gibbs 
scheme compared to k optimally scaled single block 
updates is shown to be 

((i/«=) EiM)- 1 

where c 2 j is the mean of the squares of the inverse 
scale parameters for the ith. block. Since r is the ra- 
tio of an arithmetic mean to a harmonic mean, it is 
greater than or equal to 1 and thus the Metropolis- 
within-Gibbs step is always at least as efficient as 
the block Metropolis. However, the more similar the 
blocks, the less the potential gain in efficiency. 

In practice, parameter blocks do not generally cor- 
respond to disjoint subsets of the principal axes of 
the posterior or, in terms of single parameter up- 
dates, the parameters are not generally orthogonal. 
Equation (13) therefore corresponds to a limiting 
maximum efficiency gain, obtainable only when the 
parameter sub-blocks are orthogonal. 

Algorithm 2 (MwG). Our second algorithm 
(MwG) is a sequential Metropolis-within-Gibbs al- 
gorithm with proposed jumps Yi ~ iV(0, A?). Each 
scale parameter is tuned separately to give an accep- 
tance rate of between 0.4 and 0.45 (approximately 
the optimum for a one-dimensional Gaussian target 
and proposal). 

3.3 Tailoring the Shape of a Block Proposal 

Intuition: First consider a two-dimensional target 
with roughly elliptical contours and with the scale 
of variation along one of the principal axes much 
larger than the scale of variation along the other 
(e.g., the two right-hand panels of Figure 3). The 
size of updates from a proposal of the type used in 
Algorithm 1 is constrained by the smaller of the two 
scales of variation. Thus, even when Algorithm 1 is 
optimally tuned, the efficiency of exploration along 
the larger axis depends on the ratio of the two scales 
and so can be arbitrarily low in targets where this 
ratio is large. Now consider a general target with 
roughly elliptical contours and covariance matrix XI. 
It seems intuitively sensible that a "tailored" block 
proposal distribution with the same shape and ori- 
entation as the target will tend to produce larger 
jumps along the target's major axes and smaller 



jumps along its minor axes and should therefore al- 
low for more efficient exploration of the target. 

Theory: Sherlock (2006) considered exploration of 
a unimodal elliptically symmetric target with either 
a spherically symmetric proposal or a tailored ellip- 
tically symmetric proposal in the limit as d— > oo. 
Subject to condition (11) (and a "shell"-like condi- 
tion similar to that mentioned in Section 2.2.4), it is 
shown that with each proposal shape it is in fact pos- 
sible to achieve the same optimal expected squared 
jump distance. However, if a spherically symmet- 
ric proposal is used on an elliptical target, some 
components are explored better than others and in 
some sense the overall efficiency is reduced. This be- 
comes clear on considering the ratio, r, of the ex- 
pected squared Euclidean jump distance for an op- 
timal spherically symmetric proposal to that of an 
optimal tailored proposal. Sherlock (2006) showed 
that for a sequence of targets, where the target with 
dimension d has elliptical axes with inverse scale pa- 
rameters Crfi,. . . , Cd t d, the limiting ratio is 

lim^ 00 ((l/d)Eti^ 2 )" 1 
r = j . 

lim^oo (1/d) £ i= i c 2 d i 

The numerator is the limiting harmonic mean of the 
squared inverse scale parameters, which is less than 
or equal to their arithmetic mean (the denomina- 
tor), with equality if and only if (for a given d) all 
the Cd,i are equal. Roberts and Rosenthal (2001) ex- 
amined similar relative efficiencies but for targets 
and proposals with independent components with 
inverse scale parameters C sampled from some dis- 
tribution. In this case the derived measure of relative 
efficiency is the relative speeds of the diffusion limits 
for the first component of the target 

E[C 2 ] ' 

This is again less than or equal to 1, with equal- 
ity when all the scale parameters are equal. Hence 
efficiency is indeed directly related to the relative 
compatibility between target and proposal shapes. 

Furthermore, Bedard (2008) showed that if a pro- 
posal has i.i.d. components yet the target (assumed 
to have independent components) is wildly asym- 
metric, as measured by (11), then the limiting op- 
timal acceptance rate can be anywhere between 
and 1. However, even at this optimum, some com- 
ponents will be explored infinitely more slowly than 
others. 
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In practice the shape X! of the posterior is not 
known and must be estimated, for example by nu- 
merically finding the posterior mode and the Hes- 
sian matrix H at the mode, and setting S = H . 
We employ a simple alternative which uses an earlier 
MCMC run. 

Algorithm 3 (BlkShp). Our third algorithm 
first uses an optimally scaled block RWM algorithm 
(Algorithm 1), which is run for long enough to ob- 
tain a "reasonable" estimate of the covariance from 
the posterior sample. A fresh run is then started and 
tuned to give an acceptance rate of about 0.3 but 
using proposals 

Y~ AT(0,A 2 £). 

For each dataset, so that our implementation would 
reflect likely statistical practice, each of the three 
replicates of this algorithm estimated the S ma- 
trix from iterations 1000-2000 of the corresponding 
replicate of Algorithm 1 (i.e., using 1000 iterations 
after "burn in"). In all, therefore, six different vari- 
ance matrices were used. 

3.4 Improving Tail Exploration 

Intuition: A posterior with relatively heavy poly- 
nomial tails such as the one-dimensional Cauchy dis- 
tribution has considerable mass some distance from 
the origin. Proposal scalings which efficiently ex- 
plore the body of the posterior are thus too small to 
explore much of the tail mass in a "reasonable" num- 
ber of iterations. Further, polynomial tails become 
flatter with distance from the origin so that for unit 
vector u, 7r(x + Au)/-7r(x) — > 1 as ||x||2 — > oo. Hence 
the acceptance rate for a random walk algorithm 
approaches 1 in the tails, whatever the direction of 
the proposed jump. The algorithm therefore loses al- 
most all sense of the direction to the posterior mass. 

Theory: Roberts (2003) brought together litera- 
ture relating the tails of the <i-dimensional posterior 
and proposal to the ergodicity of the Markov chain 
and hence its convergence properties. Three impor- 
tant cases are noted: 

(1) If 3s > such that 7r(x) oc e _s " x " 2 , at least out- 
side some compact set, then the random walk 
algorithm is geometrically ergodic. 

(2) If 3r > such that the tails of the proposal are 



-(r+o 



and if 



bounded by some multiple of \\x\ 

7r(x) oc ||x|| 2 ^ r+ ^ j at least outside some compact 
set, then the algorithm is polynomially ergodic 
with rate r/2. 



(3) If 3r > and i] £ (0,2) such that 7r(x) oc 

||x|| 2 ( r+cl \ at least for large enough x, and the 

proposal has tails g(x) oc ||x|| 2 ^ d+T1 \ then the al- 
gorithm is polynomially ergodic with rate r/rj. 

Thus posterior distributions with exponential or 
lighter tails lead to a geometrically ergodic Markov 
chain, whereas polynomially tailed posteriors can 
lead to polynomially ergodic chains, and even this 
is only guaranteed if the tails of the proposal are 
at least as heavy as the tails of the posterior. How- 
ever, by using a proposal with tails so heavy that 
it has infinite variance, the polynomial convergence 
rate can be made as large as is desired. 

Algorithm 4 (BlkShpCau). Our fourth algo- 
rithm is identical to BlkShp but samples the pro- 
posed jump from the heavy-tailed multivariate 
Cauchy. Proposals are generated by simulating V ~ 
N(0,t) and Z ~ JV(0, 1) and setting Y* = "V/Z. 
No acceptance rate criteria exist for proposals with 
infinite variance and so the optimal scaling param- 
eter for this algorithm was found (for each dataset 
and S) by repeating several small runs with differ- 
ent scale parameters and noting which produced the 
best ACT's for each dataset. 

Algorithm 5 (BlkShpMul). The fifth algorithm 
relies on the fact that taking logarithms of param- 
eters shifts mass from the tails to the center of the 
distribution. It uses a random walk on the posterior 
of 9 := (log -01, log ■02, log 912, log g 2 i)- Shape matri- 
ces X were estimated as for Algorithm 3, but using 
the logarithms of the posterior output from Algo- 
rithm 1. In the original parameter space this algo- 
rithm is equivalent to a proposal with components 
X* = X{e Yi and so has been called the multiplicative 
random walk (see, e.g., Dellaportas and Roberts, 
2003). In the original parameter space the accep- 
tance probability is 



ax, x 



mm 



1 nUM**) 



Since the algorithm is simply an additive random 
walk on the log parameter space, the usual accep- 
tance rate optimality criteria apply. 

A logarithmic transformation is clearly only ap- 
propriate for positive parameters and can in fact 
lead to a heavy left-hand tail if a parameter (in the 
original space) has too much mass close to zero. The 
transformation #j = sign(#j) log(l + \6i\) circumvents 
both of these problems. 
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3.5 Additional Strategies 

Scaling and shaping of the proposal, the choice 
of proposal distribution (here Gaussian or Cauchy) , 
and an informed choice between RWM and 
Metropolis-within-Gibbs updates can all lead to a 
more efficient algorithm. Building on these possibil- 
ities, we now consider two further mechanisms for 
improving efficiency: adaptive MCMC, and utilizing 
problem-specific knowledge. 

3.5.1 Adaptive MCMC 

Intuition: Algorithm 3 used the output from a 
previous MCMC run to estimate the shape Matrix 
X. An overall scaling parameter was then varied to 
give an acceptance rate of around 0.3. With adap- 
tive MCMC a single chain is run, and this chain 
gradually alters its own proposal distribution (e.g., 
changing S), by learning about the posterior from 
its own output. This simple idea has a major poten- 
tial pitfall, however. 

If the algorithm is started away from the main 
posterior mass, for example in a tail or a minor 
mode, then it initially learns about that region. It 
therefore alters the proposal so that it efficiently 
explores this region of minor importance. Worse, 
in so altering the proposal the algorithm may be- 
come even less efficient at finding the main posterior 
mass, remain in an unimportant region for longer, 
and become even more influenced by that unimpor- 
tant region. Since the transition kernel is continu- 
ally changing, potentially with this positive feedback 
mechanism, it is no longer guaranteed that the over- 
all stationary distribution of the chain is vr(-). 

A simple solution is so-called finite adaptation 
wherein the algorithm is only allowed to evolve for 
the first no iterations, after which time the transi- 
tion kernel is fixed. Such a scheme is equivalent to 
running a shorter "tuning" chain and then a longer 
subsequent chain (e.g., Algorithm 3). If the tuning 
portion of the chain has only explored a minor mode 
or a tail, this still leads to an inefficient algorithm. 
We would prefer to allow the chain to eventually cor- 
rect for any errors made at early iterations and yet 
still lead to the intended stationary distribution. It 
seems sensible that this might be achieved provided 
changes to the kernel become smaller and smaller 
as the algorithm proceeds and provided the above- 
mentioned positive feedback mechanism can never 
pervert the entire algorithm. 

Theory: At the nth iteration let T n represent the 
choice of transition kernel; for the RWM it might 



represent the current shape matrix S and the over- 
all scaling A. Denote the corresponding transition 
kernel Pr n (x, •)• Roberts and Rosenthal (2007) de- 
rived two conditions which together guarantee con- 
vergence to the stationary distribution. A key con- 
cept is that of diminishing adaptation, wherein 
changes to the kernel must become vanishingly small 
as n — > oo, 

sup||Pr n+1 (x,-) -Pr n (x,-)||i -^0 as n-^oo. 

X 

A second containment condition considers the e- 
convergence time under repeated application of a 
fixed kernel, 7, and starting point x, 

M £ (x, 7 ) := inf{n > 1 : ||P^(x, •) - vr(-)||i < e}, 

n 1 

and requires that for all 5 > there is an N such 
that for all n 

p(M £ (x n , r n ) < iv|x = x , r = 70) > 1 - s. 

The containment condition is difficult to check in 
practice; some criteria are provided in the work of 
Bai, Roberts and Rosenthal (2009). 

Adaptive MCMC is a highly active research area 
and so we confine ourselves to an adaptive version 
of Algorithm 5. Roberts and Rosenthal (2010) de- 
scribed an adaptive RWM algorithm for which the 
proposal at the nth iteration is sampled from a mix- 
ture of an adaptive N(0, ^2.38 2 X! n ) and a nonadap- 
tive Gaussian distribution; here S n is the variance 
matrix calculated from the previous n — 1 iterations 
of the scheme. Changes to the variance matrix are 
0(l/n) at the nth iteration and so the algorithm 
satisfies the diminishing adaptation condition. 

Choice of the overall scaling factor 2.38 2 /d fol- 
lows directly from the optimal scaling limit results 
reviewed in Section 3.1, with J = 1 or k x = k y . In 
general, therefore, a different scaling might be ap- 
propriate, and so our scheme extends that of Roberts 
and Rosenthal (2010) by allowing the overall scaling 
factor to adapt. 

Algorithm 6 (BlkAdpMul). Our adaptive 
MCMC algorithm is a block multiplicative random 
walk which samples jump proposals on the 
log-posterior from the mixture 

(N(0,m 2 n t n ) w.p.1-5, 
Y ~ jiV fo^Agl) w.p. 5. 

Here 5 = 0.05, d = 4, and 5] n is the variance matrix 
of the logarithms of the posterior sample to date. 
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A few minutes were spent tuning the block multi- 
plicative random walk with proposal variance tAqI 
to give at least a reasonable value for Ao (acceptance 
rate « 0.3), although this is not strictly necessary. 

To ensure a sensible nonsingular S n , proposals 
from the adaptive part of the mixture were only 
allowed once there had been at least 10 proposed 
jumps accepted. The overall scaling factor for the 
adaptive part of the kernel, m n , was initialized to 
itiq = 2.38/ d 1 / 2 and an adaptation quantity A = 
mo/100 was defined. If iteration i was from the non- 
adaptive part of the kernel, then m^i <— mf, other- 
wise: 

• If the proposal was rejected, then mj+i <— rrii — 
A/i 1 / 2 . 

• If the proposal was accepted, then mj+i ^— rrii + 
2.3A/i 1 /2. 

This leads to an equilibrium acceptance rate of 
1/3.3 ps 30%, the target acceptance rate for the other 
block updating algorithms which use Gaussian pro- 
posals (Algorithms 1, 3, and 5). Changes to m are 
scaled by i 1 ^ 2 since they must be large enough to 
adapt to changes in the covariance matrix yet small 
enough that an equilibrium value is established rel- 
atively quickly. As with the variance matrix, such 
a value would then only change noticeably if there 
were consistent evidence that it should. 

3.5.2 Utilizing problem- specific knowledge 
Intuition: Algorithms are always applied to spe- 
cific datasets with specific forms for the likelihood 
and prior. Combining techniques such as optimal 
scaling and shape adjustment with problem-specific 
knowledge can often markedly improve efficiency. In 
the case of the MMPP we define a reparameteri- 
zation based on the intuition that for an MMPP 
with tpi ps ip2 (as in D2) the data contain a great 
deal of information about the average intensity but 
relatively little information about the difference be- 
tween the intensities. 

Theory: For a two-dimensional MMPP define an 
overall transition intensity, stationary distribution, 
mean intensity at stationarity, and a measure of the 
difference between the two event intensities as fol- 
lows: 

9 : =9i2 + 92i, v:= -[921,912], 
(14) 9 

ip := v ip and o := = 

ip. 



Let t Q bs be the total observation time and t the vec- 
tor of observed event times. If the Poisson event in- 
tensities are similar, 5 is small, and Taylor expansion 
of the log-likelihood in 5 (see Sherlock, 2006) gives 

l{ip,q,5, z/i) 

(15) = nlog^ - ?/^obs + 25 2 u 1 u 2 f{i)t, qt) 

+ 5 z v lV2 {v 2 - ut)g^t,qt) + 0{5 A ) 

for some /(■, •) and g(-, •). Consider a reparameteri- 
zation from (ipx, ip 2 , 912, 92i) to (tp,q,a,(3) with 

(16) a:=25{u 1 u 2 ) 1 / 2 and p := 8{v 2 - vx). 

Parameters ip; q and a; and /3 (in this order) capture 
decreasing amounts of variation in the log-likelihood 
and so, conversely, it might be anticipated that there 
be corresponding decreasing amounts of information 
about these parameters contained in the likelihood. 
Hence very different scalings might be required for 
each. 

Algorithm 7 (MwGRep). A Metropolis- within- 
Gibbs update scheme was applied to the reparame- 
terization (ip,q,a, f3). A multiplicative random walk 
was used for each of the first three parameters (since 
they are positive) and an additive update was used 
for j3. Scalings for each of the four parameters were 
chosen to give acceptance rates of between 0.4 and 
0.45. 

Algorithm 8 (MwGRepCau). Our final algo- 
rithm is identical to MwGRep except that additive 
updates for (3 are proposed from a Cauchy distribu- 
tion. The Cauchy scaling was optimized to give the 
best ACT over the first 1000 iterations. 

4. RESULTS 

The eight algorithms described in Section 3 are 
summarized in Table 1. The table includes two fur- 
ther algorithms, an independence sampler (Algorithm 
9: IndShp), and the Gibbs sampler of Fearnhead and 
Sherlock (2006) (Algorithm 10: Gibbs); these were 
included to benchmark the efficiency of RWM algo- 
rithms against some sensible alternatives. The inde- 
pendence sampler used a multivariate t distribution 
with five degrees of freedom and the same set of 
covariance matrices as Algorithm 3. 

Each RWM variation was tested against datasets 
Dl and D2 as described in Section 2.3.1. For each 
dataset, each algorithm was started from the known 
"true" parameter values and was run three times 
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with three different random seeds (referred to as 
Replicates 1-3). All algorithms were run for 11,000 
iterations; a burn in of 1000 iterations was sufficient 
in all cases. 

Priors were independent and exponential with 
means the known "true" parameter values. The like- 
lihood of an MMPP with maximum and minimum 
Poisson intensities t/) mifi! and ip m i n and with n events 
observed over a time window of length t Q ^ s is bounded 
above by ■0™ ax e~^ mini ° bs . In this article only MMPP 
parameters and their logarithms are considered for 
estimation. Since exponential priors are employed 
the parameters and their logarithms therefore have 
finite variance, and geometric ergodicity is guaran- 
teed. 

The accuracy of posterior simulations is assessed 
via QQ plot comparison with the output from a very 
long run of a Gibbs sampler (see Section 2.2.3). QQ 
plots for almost all replicates were almost entirely 
within their 95% confidence bounds. Figure 4 shows 
such plots for Algorithms 1-3 and 9 (the indepen- 
dence sampler) on dataset D2 (Replicate 1). In gen- 
eral these combinations produced the least accurate 
performance, and only with the independence sam- 
pler is there reason to doubt that the posterior sam- 
ple is a reasonable representation of the true poste- 
rior. The relatively poor performance on D2 of Al- 
gorithms 1-3 and especially Algorithm 9 is repeated 
for the other two replicates. The third replicate of 
Algorithm 4 on D2 also showed an imperfect fit in 
the tails. 

The integrated ACT was estimated for each pa- 
rameter and each replicate using the final 10,000 it- 
erations from that replicate. Calculation of the like- 
lihood is by far the most computationally intensive 



operation (taking approximately 99.8% of the to- 
tal CPU time) and is performed four times for each 
Metropolis-within-Gibbs-iteration (once for each pa- 
rameter) and only once for each block update; a sim- 
ilar calculation is performed once for each update of 
the Gibbs sampler. To give a truer indication of over- 
all efficiency the ACTs for each Metropolis-within- 
Gibbs replicate have therefore been multiplied by 4. 
Table 2 shows the mean adjusted ACT for each algo- 
rithm, parameter, and dataset. For each set of three 
replicates most of the ACTs lay within 20% of their 
mean, and for the exceptions (Blk and BlkShpCau 
for datasets Dl and D2, and BlkShp and BlkShp- 
Mul for dataset D2) full sets of ACTs are given in 
Table 3 in the Appendix. 

In general all algorithms performed better on Dl 
than on D2 because, as discussed in Section 2.3.1, 
dataset Dl contains more information on the pa- 
rameters than D2; it therefore has lighter tails and 
is more easily explored by the chain. 

The simple block additive algorithm using Gaus- 
sian proposals with variance matrix proportional to 
the identity matrix (Blk) performs relatively poorly 
on both datasets. In absolute terms there is much 
less uncertainty about the transition intensities q\2 
and q2i (both are close to 1) than in the Poisson 
intensities tp\ (10) and ip2 (17 for Dl and 30 for 
D2) since the variance of the output from a Pois- 
son process is proportional to its value. The opti- 
mal single-scale parameter necessarily tunes to the 
smallest variance and hence explores q\2 and q2i 
much more efficiently than ipi and ip2- 

Overall performance improves enormously once 
block proposals are from a Gaussian with approx- 
imately the correct shape (BlkShp). The efficiency 



Table 1 

Summary of the algorithms used in this paper 



No. 


Abbreviation 


Description 


1 


Blk 


Block additive with tuned proposal JV(0,A 2 I). 


2 


MwG 


Sequential additive with tuned proposals iV(0, A 2 ) (i = 1, . . . ,4). 
Block additive with tuned proposal N(0, A 2 E). 


3 


BlkShp 


4 


BlkShpCau 


Block additive with tuned proposal Cauchy(0, A 2 £). 


5 


BlkShpMul 


Block multiplicative with tuned proposal N(0, A 2 £). 


6 


BlkAdpMul 


Block multiplicative with adaptively tuned mixture proposal. 


7 


MwGRep 


Sequential multiplicative/additive Gaussian; reparameterization. 


8 


MwGRepCau 


Sequential multiplicative Gaussian and additive Cauchy; reparameterization. 


9 


IndShp 


Block independence sampler with tuned proposal £5(0, £). 


10 


Gibbs 


Hidden data Gibbs sampler of Fearnhead and Sherlock (2006). 
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Fig. 4. QQ plots for algorithms Blk, MwG, BlkShp, and IndShp, on D2 (Replicate 1). Dashed lines are approximate 95% 
confidence limits obtained by repeated sampling from iterations 1000 to 100,000 of a Gibbs sampler run; sample sizes were 
10,000/ ACT, which is the effective sample size of the data being compared to the Gibbs run. 



of the Metropolis-within-Gibbs algorithm with ad- 
ditive Gaussian updates (MwG) lies somewhere be- 
tween the efficiencies of Blk and BlkShp but the im- 
provement over Blk is larger for dataset Dl than for 
dataset D2. As discussed in Section 2.3.1 the pa- 
rameters in Dl are more nearly independent than 
the parameters in D2. Thus for dataset Dl the prin- 
cipal axes of an elliptical approximation to the pos- 
terior are more nearly parallel to the cartesian axes. 
Metropolis-within-Gibbs updates are (by definition) 
parallel to each of the cartesian axes and so can 



make large updates almost directly along the major 
axis of the ellipse for dataset Dl. 

For the heavy-tailed posterior of dataset D2 we 
would expect block updates resulting from a Cauchy 
proposal (BlkShpCau) to be more efficient than those 
from a Gaussian proposal. However, for both datasets 
Cauchy proposals are slightly less efficient than Gauss- 
ian proposals. It is likely that the heaviness of the 
Cauchy tails leads to more proposals with at least 
one negative parameter, such proposals being auto- 
matically rejected. Moreover, S represents the main 
posterior mass, yet some large Cauchy jump propos- 
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Table 2 

Mean estimated integrated autocorrelation time for the four parameters over three independent replicates for datasets T)l and 

D2 



Algorithm 






Dl 








D2 




i>x 




log(<7i 2 ) 


log(«72i) 






log(«7i 2 ) 


log(<?2i) 


Blk 


66 


126 


15 


19 


176 


175 


80 


70 


MwG* 


22 


22 


33 


33 


103 


90 


114 


99 


BlkShp 


13 


18 


13 


15 


46 


25 


37 


36 


BlkShpCau 


19 


32 


25 


24 


63 


50 


56 


38 


BlkShpMul 


13 


17 


13 


15 


33 


26 


22 


16 


BlkAdpMul 


12 


12 


14 


14 


20 


20 


17 


23 


MwGRep* 


13 


14 


32 


44 


20 


23 


23 


21 


MwGRepCau* 


14 


15 


37 


42 


24 


233 


25 


23 


IndShp+ 


3.7 


o..-) 


3.5 


3.7 










Gibbs 


4.2 


3.2 


5.7 


5.9 


26 


19 


32 


27 



Notes: 'Estimates for MwG replicates have been multiplied by 4 to provide figures comparable with full block updates in 
terms of CPU time. + ACT results for the independence sampler for D2 are irrelevant since the MCMC sample was not an 
accurate representation of the posterior. 



als from this mass will be in the posterior tail. It may 
be that E does not accurately represent the shape 
of the posterior tails. 

Multiplicative updates (BlkShpMul) make little 
difference for Dl, but for the relatively heavy-tailed 
D2 there is a definite improvement over BlkShp. 
The adaptive multiplicative algorithm (BlkAdpMul) 
is slightly more efficient still, since the estimated 
variance matrix and the overall scaling are refined 
throughout the run. 

As was noted earlier in this section, due to our 
choice of exponential priors the quantities estimated 
in this article have exponential or lighter posterior 
tails and so all the nonadaptive algorithms in this 
article are geometrically ergodic. The theory in Sec- 
tion 3.4 suggests ways to improve tail exploration 
for polynomially ergodic algorithms and so, strictly 
speaking, need not apply here. However, the expo- 
nential decay only becomes dominant some distance 
from the posterior mass, especially for dataset D2. 
Polynomially increasing terms in the likelihood en- 
sure that initial decay is slower than exponential, 
and that the multiplicative random walk is there- 
fore more efficient than the additive random walk. 

The adaptive overall scaling m showed variability 
of O(0. 1) over the first 1000 iterations after which 
time it quickly settled down to 1.2 for all three repli- 
cates on Dl and to 1.1 for all three replicates on D2. 
Both of these values are very close to the scaling of 
1.19 that would be used for a four-dimensional up- 
date in the scheme of Roberts and Rosenthal (2010). 




500 1000 1500 2000 500 1000 1500 2000 




Fig. 5. Traceplots for the first 2000 iterations of BlkAdpMul 
on dataset D2 (Replicate 1). 

The algorithm similarly learned very quickly about 
the variance matrix X, with individual terms set- 
tling down after less than 2000 iterations, and with 
exploration close to optimal after less than 500 it- 
erations. This can be seen clearly in Figure 5 which 
shows traceplots for the first 2000 iterations of the 
first replicate of BlkAdpMul on D2. 

The adaptive algorithm uses its own history to 
learn about d{d + l)/2 covariance terms and a best 
overall scaling. One would therefore expect that the 
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larger the number of parameters, d, the more itera- 
tions are required for the scheme to learn about all 
of the adaptive terms and hence reach a close to op- 
timal efficiency. To test this a dataset (D3) was sim- 
ulated from a three-dimensional MMPP with ip = 
[10, 17, 30]* and q u = qrs = 921 = <?23 = <?3i = 932 = 
0.5. The following adaptive algorithm was then run 
three times, each for 20,000 iterations. 

Algorithm 6b [BlkAdpMul(b)]. This adaptive 
algorithm is identical to BlkAdpMul (with d = 9) 
except that no adaptive proposals were used until 
at least 100 nonadaptive proposals had been ac- 
cepted, and that if an adaptive proposal was ac- 
cepted then the overall scaling was updated with 
m <— 77i + 3A/i 1//2 so that the equilibrium acceptance 
rate was approximately 0.25. 

Figure 6 shows the evolution of four of the 46 
adaptive parameters (Replicate 1). All parameters 
seem close to their optimal values after 10,000 it- 
erations, although covariance parameters appear to 
be still slowly evolving even after 20,000 iterations. 
In contrast, traceplots of parameters (not shown) 
reveal that the speed of exploration of the poste- 
rior is close to its final optimum after only 1500 
iterations. This behavior was repeated across the 
other two replicates, indicating that, as with the 
two-dimensional adaptive and nonadaptive runs, 
even a very rough approximation to the variance 
matrix improves efficiency considerably. Over the 
full 20,000 iterations, all three replicates showed a 
definite multimodality with A2 often close to either 
Ai or A3, indicating that the data might reasonably 
be explained by a two-dimensional MMPP. In all 
three replicates the optimal scaling settled between 
0.25 and 0.3, noticeably lower than the Roberts and 
Rosenthal (2010) value of 2.38/^9. With reference 
to Section 3.1 this is almost certainly due to the 
roughness inherent in a multimodal posterior. 

The reparameterization of Section 3.5.2 was de- 
signed for datasets similar to D2, and on this dataset 
the resulting Metropolis-within-Gibbs algorithm 
(MwGRep) is at least as efficient as the adaptive 
multiplicative random walk. On dataset Dl, how- 
ever, exploration of q±2 and 921 is arguably less ef- 
ficient than for the Metropolis-within-Gibbs algo- 
rithm with the original parameter set. The lack of 
improvement when using a Cauchy proposal for j3 
(MwGRepCau) suggests that this inefficiency is not 
due to poor exploration of the potentially heavy- 
tailed /3. Further investigation in the (ip,q,a,(3) pa- 
rameter space showed that for dataset Dl only q was 
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Fig. 6. Plots of the adaptive scaling parameter m and 
three estimated covariance parameters Var^i], Var[gi2], and 
Cov[V>i, 912] for BlkAdpMul(b) on dataset D3 (Replicate 1). 

explored efficiently; the posteriors of ij) and (3 were 
strongly positively correlated (p~ 0.8), and both ip 
and /3 were strongly negatively correlated with a 
(p ~ —0.65). Posterior correlations were small \p\ < 
0.3 for all parameters with dataset D2 and for all 
correlations involving q for dataset Dl. 

The optimal scaling for the one-dimensional addi- 
tive Cauchy proposal in MwGRepCau was approx- 
imately two thirds of the optimal scaling for the 
one-dimensional additive Gaussian proposal in Mw- 
GRep. In four dimensions the ratio was approxi- 
mately one half. These ratios allow the Cauchy pro- 
posals to produce similar numbers of small to medium 
sized jumps to the Gaussian proposals. 

The independence sampler is arguably the most 
efficient of all of the algorithms considered for Dl. 
However, as discussed earlier in this section, there 
are doubts about the accuracy of its exploration of 
D2. Mengersen and Tweedie (1996) showed that an 
independence sampler is uniformly ergodic if and 
only if the ratio of the proposal density to the tar- 
get density is bounded below, and that one minus 
this ratio gives the geometric rate of convergence. 
To ensure the lower bound it is advisable to propose 
from a relatively heavy-tailed distribution, such as 
the tx> used here. The problem in this instance arises 
because dataset D2 could, just possibly, have been 
generated by a single Poisson process with inten- 
sity tjj ~ (ipx +^2)/2. The resulting minor mode (or, 
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more precisely, ridge) is some distance from the cen- 
ter of the distribution, resulting in a low ratio of 
proposal and target densities. 

The Gibbs sampler of Fearnhead and Sherlock 
(2006) is accurate, with its efficiency directly re- 
lated to the amount of information about the hidden 
Markov chain that is available from the data (Sher- 
lock, 2006). Thus for Dl the Gibbs sampler is more 
efficient than the best RWM algorithms, but this is 
not the case for D2. 

5. DISCUSSION 

We have described the theory and intuition behind 
a number of techniques for improving the efficiency 
of random walk Metropolis algorithms and tested 
these on two data sets generated from Markov mod- 
ulated Poisson processes (MMPPs). Tests on these 
datasets also showed a sensibly implemented RWM 
to be at least as good as some of the other avail- 
able MCMC algorithms. Some RWM implementa- 
tions were uniformly successful at improving effi- 
ciency, while for others success depended on the 
shape and/or tails of the posterior. All of the un- 
derlying concepts discussed here are quite general 
and easily applied to statistical models other than 
the MMPP. 

Simple acceptance rate tuning to obtain the op- 
timal overall variance term for a symmetric Gaus- 
sian proposal can increase efficiency by many orders 
of magnitude. However, with our datasets, even af- 
ter such tuning, the RWM algorithm was very inef- 
ficient. The effectiveness of the sampling increased 
enormously once the shape of the posterior was taken 
into account by proposing from a Gaussian with 
variance proportional to an estimate of the poste- 
rior variance. For Algorithms 3, 4, and 5 the poste- 
rior variance was estimated through a short "train- 
ing run" — the first 1000 iterations after burn in of 
Algorithm 1. 

As expected, use of the "multiplicative random 
walk" (Algorithm 5), a random walk on the poste- 
rior of the logarithm of the parameters, improved 
efficiency most noticeably on the posterior with the 
heavier tails. However, contrary to expectation, even 
on the heavier tailed posterior an additive Cauchy 
proposal (Algorithm 4) was, if anything, less efficient 
than a Gaussian. Tuning of Cauchy proposals was 
also more time-consuming since simple acceptance 
rate criteria could not be used. 

Algorithm 6 combined the successful strategies of 
optimal scaling, shape tuning, and transforming the 



data, to create a multiplicative random walk which 
learned the most efficient shape and scale parame- 
ters from its own history as it progressed. This adap- 
tive scheme was easy to implement and was arguably 
the most efficient RWM for each of the datasets. A 
slight variant of this algorithm was used to explore 
the posterior of a three-dimensional MMPP, and 
showed that in higher dimensions such algorithms 
take longer to discover close to optimal values for the 
adaptive parameters. These runs also confirmed the 
finding for the two-dimensional MMPP that RWM 
efficiency improves enormously with knowledge of 
the posterior variance, even if this knowledge is only 
approximate. For a multimodal posterior such as 
that found for the three-dimensional MMPP it might 
be argued that a different variance matrix should be 
used for each mode. Such "regionally adaptive" al- 
gorithms present additional problems, such as the 
definition of the different regions, and are discussed 
further by Roberts and Rosenthal (2010). 

Metropolis-within-Gibbs updates performed bet- 
ter when the parameters were close to orthogonal, 
at which point the algorithms were almost as effi- 
cient as an equivalent block updating algorithm with 
tuned shape matrix. The best Metropolis-within- 
Gibbs scheme for dataset D2 arose from a new repa- 
rameterization devised specifically for the 
two-dimensional MMPP with parameter orthogo- 
nality in mind. On D2 this performed nearly as well 
as the best scheme, the adaptive multiplicative ran- 
dom walk. 

The adaptive schemes discussed here provide a 
significant step toward a goal of completely auto- 
mated algorithms. However, as already discussed, 
for d model-parameters, a posterior variance ma- 
trix has 0(d 2 ) components. Hence the length of any 
"training run" or of the adaptive "learning period" 
increases quickly with dimension. For high dimen- 
sion it is therefore especially important to utilize to 
the full any problem-specific knowledge that is avail- 
able so as to provide as efficient a starting algorithm 
as possible. 

APPENDIX: RUNS WITH HIGHLY VARIABLE 
ACTS 

Three replicates were performed for each dataset 
and algorithm, and ACTs are summarized by their 
mean in Table 2. However, for certain combinations 
of the algorithms and datasets the ACTs varied con- 
siderably; full sets of ACTs for these replicates are 
given in Table 3. 
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Table 3 

Estimated ACT for the four parameters, on three independent replicates for Blk and 
BlkShpCau on dataset HI and Blk, BlkShp, BlkShpC'au, and BlkShpMul on dataset 



Algorithm 


"01 


'02 


log(«7i 2 ) 


log(<72i) 


Blk (Dl) 


59,64,75 


120,155,104 


12,15,17 


19,21,17 


BlkShpCau (Dl) 


28,16,12 


36,29,31 


20,20,35 


26,23,24 


Blk (D2) 


121,259,146 


107,262,157 


41,139,61 


51,110,48 


BlkShp (D2) 


54,51,34 


23, 24, 29 


40,45,27 


50,35,23 


BlkShpCau (D2) 


46,51,92 


46,57,48 


31,42,94 


39,41,34 


BlkShpMul (D2) 


53,24,23 


22,33,25 


20,23,24 


17,18,13 
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